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Abstract 

It is challenging to translate names and 
technical terms across languages with differ- 
ent afphabets and sound inventories. These 
items are commoniy transfiterated, i.e., re- 
pfaced with approximate phonetic equivaients. 
For exampfe, computer in Engiish comes out 
as 3 y t: — — (konpyuutaa) in Japanese. 
Transiating such items from Japanese back to 
English is even more challenging, and of prac- 
tical interest, as transliterated items make up 
the bulk of text phrases not found in bilin- 
gual dictionaries. We describe and evaluate a 
method for performing backwards translitera- 
tions by machine. This method uses a gen- 
erative model, incorporating several distinct 
stages in the transliteration process. 

1 Introduction 

Translators must deal with many problems, and 
one of the most frequent is translating proper 
names and technical terms. For language pairs 
like Spanish/English, this presents no great chal- 
lenge: a phrase like Aniomo Gil usually gets trans- 
lated as Aniomo Gil. However, the situation is 
more complicated for language pairs that employ 
very different alphabets and sound systems, such 
as Japanese/English and Arabic/English. Phonetic 
translation across these pairs is called translitera- 
tion. We will look at Japanese/English translitera- 
tion in this paper. 

Japanese frequently imports vocabulary from 
other languages, primarily (but not exclusively) 
from English. It has a special phonetic alphabet 
called kaiakana, which is used primarily (but not 
exclusively) to write down foreign names and loan- 
words. To write a word like golfbag in katakana, 
some compromises must be made. For example, 
Japanese has no distinct L and R sounds: the two En- 
glish sounds collapse onto the same Japanese sound. 
A similar compromise must be struck for English 
H and F. Also, Japanese generally uses an alter- 
nating consonant-vowel structure, making it impos- 
sible to pronounce LFB without intervening vow- 
els. Katakana writing is a syllabary rather than an 
alphabet — there is one symbol for ga (^y), another 



for gi (^"), another for gu (^'), etc. So the way 
to write golfbag in katakana is :J'A^y ^< y ^ ^ roughly 
pronounced goruhubaggu. Here are a few more ex- 
amples: 

Angela Johnson 
(anjira jyonson) 

New York Times 

— ^ — 3 — ^ • i?^AX" 

(nyuuyooku taimuzu) 

ice cream 
(aisukuriimu) 

Notice how the transliteration is more phonetic than 
orthographic; the letter h in Johnson does not pro- 
duce any katakana. Also, a dot-separator ( • ) is 
used to separate words, but not consistently. And 
transliteration is clearly an information-losing oper- 
ation: aisukuriimu loses the distinction between ice 
cream and / scream. 

Transliteration is not trivial to automate, but 
we will be concerned with an even more challeng- 
ing problem — going from katakana back to En- 
glish, i.e., back-iransliieraiion. Automating back- 
transliteration has great practical importance in 
Japanese/English machine translation. Katakana 
phrases are the largest source of text phrases that 
do not appear in bilingual dictionaries or training 
corpora (a.k.a. "not-found words"). However, very 
little computational work has been done in this area; 
(Yamron et al., 1994) briefly mentions a pattern- 
matching approach, while (Arbabi et al., 1994) dis- 
cuss a hybrid neural-net/expert-system approach to 
(forward) transliteration. 

The information-losing aspect of transliteration 
makes it hard to invert. Here are some problem in- 
stances, taken from actual newspaper articles:"'^ 

^ Texts used in ARPA Machine Translation evalua- 
tions, November 1994. 



7 

(aasudee) 

7 

(robaato shyoon renaado) 

7 

(masut aazut oonament o) 

English translations appear later in this paper. 

Here are a few observations about back- 
transliteration: 

• Back-transliteration is less forgiving than 
transliteration. There are many ways to write 
an English word like switch in katakana, all 
equally valid, but we do not have this flexibility 
in the reverse direction. For example, we can- 
not drop the i in switch, nor can we write ariure 
when we mean archer. 

• Back-transliteration is harder than romaniza- 
iion, which is a (frequently invertible) trans- 
formation of a non-roman alphabet into ro- 
man letters. There are several romanization 
schemes for katakana writing — we have already 
been using one in our examples. Katakana 
writing follows Japanese sound patterns closely, 
so katakana often doubles as a Japanese pro- 
nunciation guide. However, as we shall see, 
there are many spelling variations that compli- 
cate the mapping between Japanese sounds and 
katakana writing. 

• Finally, not all katakana phrases can be 
"sounded out" by back-transliteration. Some 
phrases are shorthand, e.g., V — 7°n (waapuro) 
should be translated as word processing. Oth- 
ers are onomatopoetic and difficult to translate. 
These cases must be solved by techniques other 
than those described here. 

The most desirable feature of an automatic back- 
transliterator is accuracy. If possible, our techniques 
should also be: 

• portable to new language pairs like Ara- 
bic/English with minimal effort, possibly 
reusing resources. 

• robust against errors introduced by optical 
character recognition. 

• relevant to speech recognition situations in 
which the speaker has a heavy foreign accent. 

• able to take textual (topical/syntactic) context 
into account, or at least be able to return a 
ranked list of possible English translations. 

Like most problems in computational linguistics, 
this one requires full world knowledge for a 100% 



solution. Choosing between Kaiarina and Catalma 
(both good guesses for A Jl -)') might even require 
detailed knowledge of geography and figure skating. 
At that level, human translators find the problem 
quite difficult as well, so we only aim to match or 
possibly exceed their performance. 

2 A Modular Learning Approach 

Bilingual glossaries contain many entries mapping 
katakana phrases onto English phrases, e.g.: {air- 
craft carrier ^Lf ^ y y \ 3y ^ \) Jtis possible 
to automatically analyze such pairs to gain enough 
knowledge to accurately map new katakana phrases 
that come along, and learning approach travels well 
to other languages pairs. However, a naive approach 
to finding direct correspondences between English 
letters and katakana symbols suffers from a number 
of problems. One can easily wind up with a sys- 
tem that proposes iskrym as a back-transliteration of 
aisukuriimu. Taking letter frequencies into account 
improves this to a more plausible-looking isclim. 
Moving to real words may give is crime: the i cor- 
responds to ai, the s corresponds to su, etc. Unfor- 
tunately, the correct answer here is ice cream. Af- 
ter initial experiments along these lines, we decided 
to step back and build a generative model of the 
transliteration process, which goes like this: 

1. An English phrase is written. 

2. A translator pronounces it in English. 

3. The pronunciation is modified to fit the 
Japanese sound inventory. 

4. The sounds are converted into katakana. 

5. Katakana is written. 

This divides our problem into five sub-problems. 
Fortunately, there are techniques for coordinating 
solutions to such sub-problems, and for using gen- 
erative models in the reverse direction. These tech- 
niques rely on probabilities and Bayes' Rule. Sup- 
pose we build an English phrase generator that pro- 
duces word sequences according to some probability 
distribution P(w). And suppose we build an English 
pronouncer that takes a word sequence and assigns 
it a set of pronunciations, again probabilistically, ac- 
cording to some V{p\w). Given a pronunciation p, 
we may want to search for the word sequence w that 
maximizes V{w\p). Bayes' Rule lets us equivalently 
maximize P(w) ■ P(p\w), exactly the two distribu- 
tions we have modeled. 

Extending this notion, we settled down to build 
five probability distributions: 

1. F(w) — generates written English word se- 
quences. 

2. P(e|w) — pronounces English word sequences. 

3. P(i|e) — converts English sounds into Japanese 
sounds. 



4. P(k\j) — converts Japanese sounds to katakana 
writing. 

5. P(o|A;) — introduces misspellings caused by op- 
tical character recognition (OCR). 

Given a katakana string o observed by OCR, we 
want to find the English word sequence w that max- 
imizes the sum, over all e, j, and k, of 

P(w;) • P(e|w;) • P{j\e) • P(A;|i) • P{o\k) 

Following (Pereira et al., 1994; Pereira and Riley, 
1996), we implement F(w) in a weighted finite-state 
acceptor (WFSA) and we implement the other dis- 
tributions in weighted finite-state transducers (WF- 
STs). A WFSA is an state/transition diagram with 
weights and symbols on the transitions, making 
some output sequences more likely than others. A 
WFST is a WFSA with a pair of symbols on each 
transition, one input and one output. Inputs and 
outputs may include the empty symbol e. Also fol- 
lowing (Pereira and Riley, 1996), we have imple- 
mented a general composition algorithm for con- 
structing an integrated model P(a;|z) from models 
P(x\y) and P(y\z), treating WFSAs as WFSTs with 
identical inputs and outputs. We use this to combine 
an observed katakana string with each of the mod- 
els in turn. The result is a large WFSA containing 
all possible English translations. We use Dijkstra's 
shortest-path algorithm (Dijkstra, 1959) to extract 
the most probable one. 

The approach is modular. We can test each en- 
gine independently and be confident that their re- 
sults are combined correctly. We do no pruning, 
so the final WFSA contains every solution, however 
unlikely. The only approximation is the Viterbi one, 
which searches for the best path through a WFSA 
instead of the best sequence (i.e., the same sequence 
does not receive bonus points for appearing more 
than once). 

3 Probabilistic Models 

This section describes how we designed and built 
each of our five models. For consistency, we continue 
to print written English word sequences in italics 
(golf ball), English sound sequences in all capitals 
(G AA L F B AO L), Japanese sound sequences in 
lower case (goruhubooru) and katakana 
sequences naturally (^'A^y^—A^). 

3.1 Word Sequences 

The first model generates scored word sequences, 
the idea being that ice cream should score higher 
than ice creme, which should score higher than 
aice kreem. We adopted a simple unigram scor- 
ing method that multiplies the scores of the known 
words and phrases in a sequence. Our 262,000-entry 
frequency list draws its words and phrases from the 
Wall Street Journal corpus, an online English name 




list, and an online gazeteer of place names. ^ A por- 
tion of the WFSA looks like this: 

los 1 0.000087 

federal / 0.00 1 3 /aV\ angeles 

month 1 0.000992 

An ideal word sequence model would look a bit 
different. It would prefer exactly those strings 
which are actually grist for Japanese translitera- 
tors. For example, people rarely transliterate aux- 
iliary verbs, but surnames are often transliterated. 
We have approximated such a model by removing 
high-frequency words like has, an, are, am, were, 
their, and does, plus unlikely words corresponding 
to Japanese sound bites, like coup and oh. 

We also built a separate word sequence model con- 
taining only English first and last names. If we know 
(from context) that the transliterated phrase is a 
personal name, this model is more precise. 

3.2 Words to English Sounds 

The next WFST converts English word sequences 
into English sound sequences. We use the English 
phoneme inventory from the online CMU Pronuncia- 
tion Dictionary,^ minus the stress marks. This gives 
a total of 40 sounds, including 14 vowel sounds (e.g., 
AA, AE, U¥), 25 consonant sounds (e.g., K, HH, R), plus 
our special symbol (PAUSE). The dictionary has pro- 
nunciations for 110,000 words, and we organized a 
phoneme-tree based WFST from it: 




Isaac: ^ 



Note that we insert an optional PAUSE between word 
pronunciations. Due to memory limitations, we only 
used the 50,000 most frequent words. 

We originally thought to build a general letter- 
to-sound WFST, on the theory that while wrong 
(overgeneralized) pronunciations might occasionally 
be generated, Japanese transliterators also mispro- 
nounce words. However, our letter-to-sound WFST 
did not match the performance of Japanese translit- 



^ Available from the ACL Data Collection Initiative, 
http : //¥¥¥ . speech. cs . emu. edu/cgi-bin/cmudict. 



erators, and it turns out that mispronunciations are 
modeled adequately in the next stage of the cascade. 

3.3 English Sounds to Japanese Sounds 

Next, we map English sound sequences onto 
Japanese sound sequences. This is an inherently 
information-losing process, as English R and L 
sounds collapse onto Japanese r, the 14 English 
vowel sounds collapse onto the 5 Japanese vowel 
sounds, etc. We face two immediate problems: 

1. What is the target Japanese sound inventory? 

2. How can we build a WEST to perform the se- 
quence mapping? 

An obvious target inventory is the Japanese syl- 
labary itself, written down in katakana (e.g., — ) or 
a roman equivalent (e.g., ni). With this approach, 
the English sound K corresponds to one of ii (ka), 
+ (ki), ^ (ku), ^ (ke), or ^ (ko), depending on 
its context. Unfortunately, because katakana is a 
syllabary, we would be unable to express an obvi- 
ous and useful generalization, namely that English 
K usually corresponds to Japanese k, independent of 
context. Moreover, the correspondence of Japanese 
katakana writing to Japanese sound sequences is not 
perfectly one-to-one (see next section), so an inde- 
pendent sound inventory is well-motivated in any 
case. Our Japanese sound inventory includes 39 
symbols: 5 vowel sounds, 33 consonant sounds (in- 
cluding doubled consonants like kk), and one spe- 
cial symbol (pause). An English sound sequence 
like (P R 0¥ PAUSE S AA K ER) might map onto a 
Japanese sound sequence like (p u r o pause s a 
kk a a). Note that long Japanese vowel sounds are 
written with two symbols (a a) instead of just one 
(aa). This scheme is attractive because Japanese 
sequences are almost always longer than English se- 
quences. 

Our WEST is learned automatically from 8,000 
pairs of English/Japanese sound sequences, e.g., ((S 
AA K ER) ^ (s a kk a a)). We were able to pro- 
duce these pairs by manipulating a small English- 
katakana glossary. Eor each glossary entry, we 
converted English words into English sounds us- 
ing the previous section's model, and we converted 
katakana words into Japanese sounds using the next 
section's model. We then applied the estimation- 
maximization (EM) algorithm (Baum, 1972) to gen- 
erate symbol-mapping probabilities, shown in Fig- 
ure 1. Our EM training goes like this: 

1. For each English/Japanese sequence pair, com- 
pute all possible ahgnmenis between their ele- 
ments. In our case, an alignment is a drawing 
that connects each English sound with one or 
more Japanese sounds, such that all Japanese 
sounds are covered and no lines cross. For ex- 
ample, there are two ways to align the pair ((L 
0¥) <-> (r o o)): 



L 0¥ L 0¥ 

I /\ /\ I 

r o o r o o 

2. For each pair, assign an equal weight to each 
of its alignments, such that those weights sum 
to 1. In the case above, each alignment gets a 
weight of 0.5. 

3. For each of the 40 English sounds, count up in- 
stances of its different mappings, as observed in 
all alignments of all pairs. Each alignment con- 
tributes counts in proportion to its own weight. 

4. For each of the 40 English sounds, normalize the 
scores of the Japanese sequences it maps to, so 
that the scores sum to 1. These are the symbol- 
mapping probabilities shown in Figure 1. 

5. Recompute the alignment scores. Each align- 
ment is scored with the product of the scores of 
the symbol mappings it contains. 

6. Normalize the alignment scores. Scores for each 
pair's alignments should sum to 1. 

7. Repeat 3-6 until the symbol-mapping probabil- 
ities converge. 

We then build a WEST directly from the symbol- 
mapping probabilities: 

PAUSE:pause 
AA:a/ 0.024 O AA:o/ 0.018 

^AA:a/ 0.382 

Our WEST has 99 states and 283 arcs. 

We have also built models that allow individual 
English sounds to be "swallowed" (i.e., produce zero 
Japanese sounds). However, these models are ex- 
pensive to compute (many more alignments) and 
lead to a vast number of hypotheses during WEST 
composition. Furthermore, in disallowing "swallow- 
ing," we were able to automatically remove hun- 
dreds of potentially harmful pairs from our train- 
ing set, e.g., ((B AA R B ER SH AA P) ^ (b a a 
b a a)). Because no alignments are possible, such 
pairs are skipped by the learning algorithm; cases 
like these must be solved by dictionary lookup any- 
way. Only two pairs failed to align when we wished 
they had — both involved turning English Y U¥ into 
Japanese u, as in ((Y U¥ K AH L EY L lY) ^ (u 
kurere)). 

Note also that our model translates each English 
sound without regard to context. We have built also 
context-based models, using decision trees recoded 
as WFSTs. For example, at the end of a word, En- 
glish T is likely to come out as (t o) rather than (t). 
However, context-based models proved unnecessary 
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Figure 1: English sounds (in capitals) with probabilistic mappings to Japanese sound sequences (in lower 
case), as learned by estimation-maximization. Only mappings with conditional probabilities greater than 
1% are shown, so the figures may not sum to 1. 



for back-transliteration.'* They are more useful for 
English-to-Japanese forward transliteration. 

3.4 Japanese sounds to Katakana 

To map Japanese sound sequences like (moot 
a a) onto katakana sequences like (-^r— i?— ), we 
manually constructed two WFSTs. Composed to- 
gether, they yield an integrated WFST with 53 
states and 303 arcs. The first WFST simply merges 
long Japanese vowel sounds into new symbols aa, ii, 
uu, ee, and oo. The second WFST maps Japanese 
sounds onto katakana symbols. The basic idea is 
to consume a whole syllable worth of sounds before 
producing any katakana, e.g.: 

o: 3 




This fragment shows one kind of spelling varia- 
tion in Japanese: long vowel sounds (oo) are usu- 
ally written with a long vowel mark (^~) but are 
sometimes written with repeated katakana 
We combined corpus analysis with guidelines from 
a Japanese textbook (Jorden and Chaplin, 1976) 
to turn up many spelling variations and unusual 
katakana symbols: 

• the sound sequence (j i) is usually written v^, 
but occasionally 

• (g u a) is usually ^7 , but occasionally ^7 . 

• (woo) is variously — , :i — , or with a 
special, old-style katakana for wo. 

• (y e) may he J^, 4 or 4 ■ 

• (w i) is either 4 or A . 

• (n y e) is a rare sound sequence, but is written 
— jL when it occurs. 

• (t y u) is rarer than (ch y u), but is written 
y- ^ when it occurs. 

and so on. 

Spelling variation is clearest in cases where an En- 
glish word like switch shows up transliterated vari- 
ously A y ^ , ^4 y ^ , A y ^) in different 
dictionaries. Treating these variations as an equiv- 
alence class enables us to learn general sound map- 
pings even if our bilingual glossary adheres to a sin- 
gle narrow spelling convention. We do not, however, 

*And harmfully restrictive in their unsmoothed 
incarnations. 



generate all katakana sequences with this model; 
for example, we do not output strings that begin 
with a subscripted vowel katakana. So this model 
also serves to filter out some ill-formed katakana 
sequences, possibly proposed by optical character 
recognition. 

3.5 Katakana to OCR 

Perhaps uncharitably, we can view optical character 
recognition (OCR) as a device that garbles perfectly 
good katakana sequences. Typical confusions made 
by our commercial OCR system include for 
^ for , T for 7, and 7 for 7°. To generate pre- 
OCR text, we collected 19,500 characters worth of 
katakana words, stored them in a file, and printed 
them out. To generate post-OCR text, we OCR'd 
the printouts. We then ran the EM algorithm to de- 
termine symbol-mapping ("garbling") probabilities. 
Here is part of that table: 
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This model outputs a superset of the 81 katakana 
symbols, including spurious quote marks, alphabetic 
symbols, and the numeral 7. 

4 Example 

We can now use the models to do a sample back- 
transliteration. We start with a katakana phrase 
as observed by OCR. We then serially compose it 
with the models, in reverse order. Each intermedi- 
ate stage is a WFSA that encodes many possibilities. 
The final stage contains all back-transliterations sug- 
gested by the models, and we finally extract the best 
one. 

We start with the masutaazutoonamento problem 
from Section 1. Our OCR observes: 

This string has two recognition errors: (ku) 
for (ta), and =f- (chi) for -j- (na). We turn the 
string into a chained 12-state/ll-arc WFSA and 
compose it with the P(A;|o) model. This yields a fat- 
ter 12-state/15-arc WFSA, which accepts the cor- 
rect spelling at a lower probability. Next comes 
the P(i|A;) model, which produces a 28-state/31-arc 
WFSA whose highest-scoring sequence is: 

masutaazutoochimento 

Next comes P(e|j), yielding a 62-state/241-arc 
WFSA whose best sequence is: 

M AE S T AE AE DH UH T AO AO CH IH M EH N T AO 



Next to last comes P(w|e), which results in a 2982- 
state/4601-arc WFSA whose best sequence (out of 
myriads) is: 

masters tone am ent awe 

This English string is closest phonetically to the 
Japanese, but we are willing to trade phonetic prox- 
imity for more sensical English; we rescore this 
WFSA by composing it with F(w) and extract the 
best translation: 

masters tournament 

(Other Section 1 examples are translated correctly 
as earth day and robert sean leonard.) 

5 Experiments 

We have performed two large-scale experiments, one 
using a full-language F(w) model, and one using a 
personal name language model. 

In the first experiment, we extracted 1449 unique 
katakana phrases from a corpus of 100 short news 
articles. Of these, 222 were missing from an on- 
line 100,000-entry bilingual dictionary. We back- 
transliterated these 222 phrases. Many of the trans- 
lations are perfect: technical program, sex scandal, 
omaha beach, new york times, ramon diaz. Oth- 
ers are close: tanya hardmg, nickel Simpson, danger 
Washington, world cap. Some miss the mark: nancy 
care again, plus occur, patriot miss real. While it 
is difficult to judge overall accuracy — some of the 
phases are onomatopoetic, and others are simply too 
hard even for good human translators — it is easier 
to identify system weaknesses, and most of these lie 
in the F(w) model. For example, nancy kerrigan 
should be preferred over nancy care again. 

In a second experiment, we took katakana 
versions of the names of 100 U.S. politicians, 
e.g.: H V • 7' n — (jyon.buroo), T^^*l^^•^* 

-7 ^ (arhonsu.damatto), and -^4:^'7^V4"y 
(maiku.dewain). We back-transliterated these by 
machine and asked four human subjects to do the 
same. These subjects were native English speakers 
and news-aware; we gave them brief instructions, ex- 
amples, and hints. The results were as follows: 

human machine 

correct 27% 64% 

(e.g., spencer abraham / 
spencer abraham) 

phonetically equivalent, 7% 12% 

but misspelled 
(e.g., richard brian / 
richard bryan) 

incorrect 66% 24% 

(e.g., olm hatch / 
orren hatch) 



There is room for improvement on both sides. Be- 
ing English speakers, the human subjects were good 
at English name spelling and U.S. politics, but not 
at Japanese phonetics. A native Japanese speaker 
might be expert at the latter but not the former. 
People who are expert in all of these areas, however, 
are rare. 

On the automatic side, many errors can be cor- 
rected. A first-name/last-name model would rank 
richard bryan more highly than richard brian. A bi- 
gram model would prefer orren hatch over olm hatch. 
Other errors are due to unigram training problems, 
or more rarely, incorrect or brittle phonetic models. 
For example, "Long" occurs much more often than 
"Ron" in newspaper text, and our word selection 
does not exclude phrases like "Long Island." So we 
get long wyden instead of ron wyden. Rare errors 
are due to incorrect or brittle phonetic models. 

Still the machine's performance is impressive. 
When word separators ( • ) are removed from the 
katakana phrases, rendering the task exceedingly dif- 
ficult for people, the machine's performance is un- 
changed. When we use OCR, 7% of katakana tokens 
are mis-recognized, affecting 50% of test strings, but 
accuracy only drops from 64% to 52%. 

6 Discussion 

We have presented a method for automatic back- 
transliteration which, while far from perfect, is 
highly competitive. It also achieves the objectives 
outlined in Section 1. It ports easily to new lan- 
guage pairs; the F(w) and P(e|w) models are entirely 
reusable, while other models are learned automati- 
cally. It is robust against OCR noise, in a rare ex- 
ample of high-level language processing being useful 
(necessary, even) in improving low-level OCR. 

We plan to replace our shortest-path extraction 
algorithm with one of the recently developed k- 
shortest path algorithms (Eppstein, 1994). We will 
then return a ranked list of the k best translations 
for subsequent contextual disambiguation, either by 
machine or as part of an interactive man-machine 
system. We also plan to explore probabilistic models 
for Arabic/English transliteration. Simply identify- 
ing which Arabic words to transliterate is a difficult 
task in itself; and while Japanese tends to insert ex- 
tra vowel sounds, Arabic is usually written without 
any (short) vowels. Finally, it should also be pos- 
sible to embed our phonetic shift model P(i|e) in- 
side a speech recognizer, to help adjust for a heavy 
Japanese accent, although we have not experimented 
in this area. 
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